Parallel symbolic state-space exploration is difficult, 
but what is the alternative?* 



Gianfranco Ciardo Yang Zhao Xiaoqing Jin 

Department of Computer Science and Engineering 
University of California, Riverside 

{ciardo , zhaoy , j inx}@cs . ucr . edu 

State-space exploration is an essential first step in many modeling and analysis problems. Its goal 
is to find and store all the states reachable from the initial state(s) of a discrete-state high-level model 
described, for example, using pseudocode or Petri nets. The state space can then be used to answer 
important questions, such as "Is there a dead state?" and "Can variable n ever become negative?", or as 
the starting point for sophisticated investigations expressed in temporal logic. 

Unfortunately, the state space is often so large that ordinary explicit data structures and sequen- 
tial algorithms simply cannot cope, prompting the exploration of parallel or symbolic approaches. The 
former uses multiple processors, from simple networks of workstations to expensive shared-memory 
supercomputers or, more recently, powerful multicore workstations, to satisfy the large memory and run- 
time requirements. The latter uses decision diagrams to compactly encode the large structured sets and 
relations manipulated during state-space generation. 

Both approaches have merits and limitations. Parallel explicit state-space generation is challenging, 
but close to linear speedup can be achieved, thus its scalability can be quite good; however, the analysis is 
ultimately and obviously limited by the amount of memory and number of processors available overall. 
Symbolic methods rely on the heuristic properties of decision diagrams, which can encode many, but 
by no means all, functions over a structured and exponentially large domain in polynomial space; here 
the pitfalls are subtler, as the performance of symbolic approaches can vary widely depending on the 
particular class of decision diagram chosen, on the order in which the variables describing the state are 
considered, and on many obscure algorithmic parameters. 

In this paper, we survey both approaches. Observing that symbolic approaches are often enormously 
more efficient than explicit ones for many practical models (although it is rarely obvious a priori whether 
this will be the case on a particular application), we argue for the need to parallelize symbolic state-space 
generation algorithms, so that we can realize the advantage of both approaches. Unfortunately, this is a 
very challenging endeavor, as the most efficient symbolic algorithm, Saturation, is inherently sequential. 
We conclude by discussing challenges, efforts, and promising directions toward this goal. 

1 Introduction 

Model checking was introduced almost three decades ago and has gradually been adopted in industrial 
applications. State-space generation forms the base of safety checking and the first step towards more 
complex temporal property checking. In some scenarios, such as VLSI circuits, the potential state space 
is finite; on the other hand, high-level models such as Petri Nets may have an infinite state space. Fur- 
thermore, even when a given Petri Net is bounded, a finite bound on the potential state space is usually 
not known a priori. Hence, state-space generation is an essential and interesting problem. 
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The most important metrics to evaluate the effectiveness of state-space generation are memory con- 
sumption and run time. These two metrics are often closely related, but ultimately reflect different 
complexity aspects. Considering run time, state-space generation can be expressed as a fixpoint itera- 
tion and, for models with very large diameter (maximum distance from an initial state to any reachable 
state), this iteration can be very time consuming. Considering memory consumption, the state space of 
complex models is often too large to fit in main memory, or even in secondary storage. Even when the 
latter suffices, large memory consumption leads to frequent swaps between main and virtual memory, 
with negative effects on the run time. Fortunately, processors and memory are becoming cheaper at each 
new generation, so that multi-core processors and multi-processor systems provide larger computational 
resources at the same or lower price. The memory bottleneck can be relieved using more memory per 
workstation and multiple workstations, but tackling the long run times is more difficult. While multi- 
processor systems enable the execution of multiple tasks in parallel, it is hard to achieve a speedup linear 
in the number of processors. The difficulty lies in the need to find enough parallel tasks to fully exploit 
the available processors. We believe that speedup will be a fundamental concern for future research on 
parallel formal verification algorithms. 

1.1 Problem setting and notation 

If we ignore the particular high-level formalism used to express our system, we are interested in studying 
a discrete state model fully specified by: 

• A set of states, or potential state space 3£ pot , which describes the "type" of the states. 

• A set of initial states ^i n u C 2£ pot from which the system behavior can evolve. Often, there is a 
single initial state, X init = {i r „/ f }. 

• A next-state function jV : 3£ pot — > 2 >**, which describes the states to which a system can move 
in one step. This function can naturally be extended to sets of states, Jf{SC} = Uie.gr 

We observe that the model is nondeterministic, unless, for all states i £ X pot , \.jVi^)\ < 1, and we say 
that state i is absorbing (or a trap, or dead, or a sink) if -yV{\) = 0. In the literature, a transition relation is 
sometimes defined instead of the next-state function, but the two carry exactly the same information: the 
pair of states (i,j) is in the transition relation iff j S ^(i). Then, we assume that the state is structured: 

• SEpot = ^z, x • • • x 3C\ = X L>k> l <%> so that a (global) state is of the form i =(?£,..., i\ ), and 
is the (discrete) local state space for submodel k or the local domain for state variable 

The techniques we consider assume that 3C pot is finite, thus must be finite as well, and we can map 
it to {0, 1, — 1}. If nk is unknown a priori, we can initially let = N, the set of natural numbers, 
and discover the value of n\ later, as we explore the model. Finally, we assume asynchronous behavior, 
that is, there is a set $ of events defining a disjunctively-partitioned next-state function: 

• For each event a&S 1 , JV a : ,%' po t — > 2 x i>°< . State j can be reached by firing a in state i iff j G jY a (i) . 

• The overall next-state function is the union of the functions for each event, jV(^) = Ucee#< y ^a(i)- 

• We say that event a is enabled in i iff jV a (i) ^ 0, otherwise we say it is disabled. 

The main goal of our study is then to generate and store the {reachable, or actual) state space X rc h 
of the model, that is, the smallest subset of X pot containing X\ n i t and satisfying: 

• The recursive definition i G 3£ rc h A j 6 ^(i) j € S£ r ch- 

• Or, equivalently, the fixpoint equation % = 3C U^V(^). 
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The most obvious way to think of the state space is as the limit of the expression 

X m UJf{% inU ) \JJf{Jf{% m )) U^{^{Jf{% inU ))) U • • • 

but we stress that, while this expression suggests a breadth-first iteration where states at distance d from 
S^i n i t are discovered after exactly d iterations (i.e., after d applications of the next-state functions ,yV), 
this is neither implied by the definition nor it is necessarily the most efficient way to build 9£ r ch- 

Beyond state-space generation, more advanced analyses can be performed on a discrete-state model. 
For example, in the temporal logic CTL ET1I331 , the operators EX, EU, and EG are complete, that is, they 
can be used to express any CTL operator through complementation, conjunction, and disjunction. If s/ 
and 38 are the sets of states satisfying CTL formulae a and b, respectively, the set of states satisfying EXa 
is JTex« = {i : 3j G sf nyf(i)}, thus %EXa — <^~ l where ,jY~ x is the backward state function, 
i.e., Jf~ x { jr) = {i : 3j G (j G ^(i))}. The set of states satisfying EaUb is instead 

%E aUb = {i (0) :3d >o(Vc€ {0,...,rf-l}(i w G^ABi^ 1 ' G ,/K(i (c) )) Ai w G^)} 

and can be characterized as the smallest solution of the fixpoint equation C Ju (s/ C\-yV~ l (X)). 
Analogously, the set of states satisfying EGa is 

9fcGa = {i (0) : Vrf > 0(Vc G {0,...,d- 1} (i (c) G sf A 3i {c+ ^ G 

and can be characterized as the largest solution of the fixpoint equation X D s/ C\^¥ {3£). 

We focus on state-space generation, but many of the problems faced are analogous to those for the 
computation of these more complex fixpoints, and many of the possible solutions are applicable to them. 

1.2 Explicit vs. implicit techniques, which one should we parallelize? 

Symbolic model checking [HO is undoubtedly a significant breakthrough in formal verification. Instead 
of representing states explicitly, symbolic approaches exploit advanced data structures to encode and 
manipulate entire sets of states at once. Paired with binary decision diagrams (BDDs) Q, symbolic 
model checkers are able to handle enormous state spaces. At the same time, several questions remain 
open for BDD-based symbolic model checking. One of the most challenging ones is that the evolution of 
the BDDs being manipulated is quite unpredictable during the fixpoint iterations, and it is their peak size, 
often many orders of magnitude larger than the final result being sought, that may exceed the available 
memory and cause the program to fail, making symbolic algorithms brittle and subtle. To alleviate 
this problem, much research has been devoted to static or dynamic variable ordering, quantification 
scheduling, and BDD partitioning, to name a few. After two decades, BDD-based symbolic techniques 
have become mainstream for the verification of synchronous systems, such as VLSI circuits. 

However, state-space generation for asynchronous models such as Petri nets, communicating se- 
quential processes, and process algebras appears more challenging. Although gradually replaced in syn- 
chronous systems, explicit techniques are still competitive for asynchronous systems. Such techniques 
take advantage of the locality and symmetry properties widely enjoyed by asynchronous systems. Partial 
order reduction |[24l l43l and symmetry reduction iflOl have been successfully implemented in explicit 
model checkers to reduce the number of states that must be explored and stored, thus the run time. These 
approaches explore and store only a "representative" subset of the reachable states, but are neverthe- 
less able to answer the same questions as an exhaustive search. Moreover, low-level memory reduction 
techniques such as hash compaction are widely employed to further reduce memory requirements. 
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SymbolicBfSsGen{3£i n itiJY) is 




• known states 
• unexplored states 



3 while ^ ^0 do 

5 <%^2e\<3/\ 

6 ^^^U^; 

7 return ^ ; 



• potentially new states 
• tru/y new states 



Figure 1 : The traditional breadth-first symbolic state-space generation. 



The need to choose between explicit and implicit techniques, then, arises mainly for the verification 
and analysis of asynchronous systems, as many factors can reduce the effectiveness of symbolic algo- 
rithms when applied to asynchronous models. Traditional BDD techniques can efficiently encode com- 
plex synchronous transition relations, but they are often not as compact when applied to asynchronous 
events, even if disjunctive partitioning [9 ] often helps. Analogously, image computation, the base of sym- 
bolic state-space generation, is often expensive for asynchronous systems. Thus, tools like SPIN |[28l . a 
model checker with advanced explicit techniques, have achieved wide acceptance and success in indus- 
trial protocol verification. Furthermore, as shown in the following sections, the parallelization of explicit 
algorithms is much more successful than that of symbolic ones, in the sense that explicit techniques can 
achieve almost linear speedup when devising parallel implementations for them. 

While it appears that the parallelization of explicit techniques is more promising than that of symbolic 
ones, this paper argues that symbolic techniques have nevertheless often the best chance to shine, if not 
in speedup, certainly in ultimate performance. The most traditional approach to symbolic state-space 
generation is shown in Fig. (TJ where all sets, i.e., <3f, % ', and 2 ', and relations, i.e., JV ', are encoded 
using BDDs (if the local state variable x^ is not boolean, we can use [log 2 «/tl boolean variables for it, 
or we can use MDDs, in particular extensible MDDs B31 if is not known a priori; in the following, 
we simply use the term DD). This is essentially a symbolic breadth-first exploration, where iteration d 
discovers all states at distance d from 3£i n j t . While this approach is often very effective, our confidence 
in the appropriateness of symbolic techniques mainly comes from an even better algorithm: Saturation. 

Initially defined for asynchronous models satisfying Kronecker-consistency fTTl . where, for each 
event a, jV a is the conjunction of L "local" functions ,yV a ,k '• &k ~^ 2**, for L > k > 1, Saturation has later 
been extended to a fully general setting where each jV a is the conjunction of some number of functions, 
each depending and affecting a subset of the state variables (20]. The main idea behind Saturation is that 
this disjunctive-then-conjunctive decomposition highlights the locality of each event a, so that we can 
define Top(a) to be the highest local state variable on which the enabling or the effect of a depends, or 
which a affects. Then, we build the DD encoding J^ mf and saturate its nodes bottom up, applying, for 
k from 1 to L, all events a with Top(a) =kto it, exhaustively, until no more states are discovered, with 
the proviso that, whenever saturating a DD node p at level k causes the creation of a DD node q at a level 
below k, we saturate node q before completing the saturation of the higher node p. Thus, when saturated, 
DD node p encodes a fixpoint with respect to ^Y<^ = \J a -Top{a)<k^ / a and, when we saturate the root node 
at level L, we have the entire state space, that is, the fixpoint of X = 3£ U jY{S£\ The result is a much 
more efficient state-space generation algorithm often requiring many orders of magnitude less memory 
and run time than symbolic breadth-first iterations. Indeed, we have applied the Saturation approach also 
to CTL model checking [47], distance function computation lfT9l . and timed reachability in integer-timed 
models fl4l . but here we will limit our discussion to state-space generation. 
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1.3 Classification of previous work on parallel state-space analysis 

Modern technology offers new parallel platforms for both explicit and implicit techniques. In general, 
there are two methodologies for parallelization: data decomposition and functional decomposition. Data 
decomposition distributes the data to be processed across parallel tasks, each of which executes on a 
different workstation. Functional decomposition exploits the parallelism within a function computed by 
an application, allowing the distribution of computation over multiple processors or cores. 

Parallel DD-based algorithms have been developed for a variety of platforms: shared memory multi- 
processor or multi-core systems [ f4T| . network of workstations (NOW) [343, distributed shared memory 
(DSM) architectures |[37l . single-instruction-multiple-data (SIMD) and multiple-instruction-multiple- 
data (MIMD) architectures 11231 . and vector processors f36l According to the nature of the state- 
space generation algorithm, we can generally classify the implementation platform into two categories: 
distributed-memory architecture vs. shared-memory architecture. The former, e.g., NOW or PC clusters, 
has the advantage of possessing abundant resources to handle large systems. The latter, e.g., multi- 
processor multi-core systems, is becoming the predominant technology trend. We omit detailed discus- 
sion of these platforms, and focus on their characteristic features and challenges. For distributed-memory 
systems, the main considerations are how to distribute data, maintain load balance, and reduce commu- 
nication overhead and latency. For shared-memory systems, the mechanism employed to guarantee 
mutually exclusive access to a memory region, load balance, and task scheduling are instead paramount. 
Most literature over the last 25 years has been devoted to algorithms on distributed-memory systems due 
to their ability to overcome memory constraints and to the pervasiveness of computer networks. The 
advent of inexpensive memory and multi-core systems has reignited interest in shared-memory systems. 

With respect to an orthogonal classification that takes into consideration not the hardware architecture 
but the type of data structure employed by the state-space generation algorithm, two main approaches 
exist. Traditional explicit state space generation approach, such as the one used in the model-checking 
tool SPIN ll28ll . enumerates and explores each state one by one, and was first parallelized in Ifl6ll35ll40l . 
The other approach, DD-based state-space generation, is behind all commonly used symbolic model- 
checking tools, such as SMV (301 and SMART lfl8l . 

The remainder of this paper is organized as follows. Section [2] focuses on the explicit distributed- 
memory approaches proposed in El [3] [321 HOI and the explicit shared-memory methods presented in 
El [5] [29]]. Section [3] surveys parallel symbolic approaches for distributed-memory architectures ifTlfrTl 
El|25l|26l|271|3l]|4B|42l|4ll, as well as the shared-memory approach of E3. Section H discusses 
some promising directions for further research on optimizing the parallelization of symbolic state-space 
generation algorithms. 

2 Parallelizing explicit state-space generation 

Most work in parallel explicit state-space generation has focused on distributed-memory approaches that 
utilize inexpensive NOWs, although shared memory approaches have also been explored. 

2.1 Distributed-memory approaches for explicit state-space generation 

As memory consumption is the main bottleneck for explicit techniques, being able to exploit the overall 
storage available on a NOW is an appealing idea. Most approaches along these lines follow the general 
framework of Fig. [2] Assuming that there are N workstations indexed with an identifier w ranging from 1 
to N, a function X : 3£ pot — > { 1 , . . . , N} is used to partition the potential state space 3£ pot into N classes, that 
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Figure 2: General framework for distributed explicit state-space generation. 



is, assign an "owner" workstation to each state. Then, each workstation performs essentially the same 
steps as those required for sequential explicit state-space generation, except that each state j reachable 
from the state i currently being explored is checked first to see if it belongs to a different workstation, in 
which case j is sent to it, not processed locally. 

In this framework, several important factors affect performance. First and foremost, the state-space 
partitioning function A has great impact on both workload and memory balance. Second and almost as 
important, the communication overhead and latency must be taken into account when deciding how and 
when to exchange states. Specifically, we need to decide whether states belonging to other workstations 
are sent immediately after being discovered, or if they are buffered into larger messages and, in this case, 
how large the message buffers should be. Then, we need to decide the frequency at which a workstation 
checks for incoming states sent to it, as waiting too long to receive these states can cause incoming buffers 
to grow too large, possibly with many duplicate states in them. The goal of tuning these parameters is 
then to keep all workstations busy most of the time, while attempting to achieve similar proportions of 
memory usage in each workstation and minimizing the number of message exchanges. 

Of course, these factors might be in conflict. Obviously, a partitioning function where all states be- 
long to one workstation has no communication at all, but the worst workload and memory balance. More 
interestingly, a perfect hash function for A will instead achieve excellent workload and memory balance, 
but also maximize communication for state exchanges, as the probability that any state j reachable from 
i belongs to the same workstation as i is only 1 /N. Thus, a good choice of A should still achieve a good 
workload and memory balance, but at the same time guarantee that most state-to-state transitions remain 
within the same workstation, thus require no communication. A hash function is often used to define A, 
and an approach to achieve a good compromise was discussed in lfl6l for the case of Petri nets, where, 
by hashing the state on just a few of its components (the number of tokens in only a few of the Petri net 
places), we ensure that any Petri net transition not affecting those places will result in states belonging to 
the same workstation. However, even employing this idea, it is possible to define A so that the mapping 
of reachable states (as opposed to potential states) is highly uneven. 

We proposed a completely different way to define A in l35l . by organizing the states in a search 
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tree (a data structure commonly used in explicit state-space exploration anyway), in such a way that the 
top few levels of the tree are duplicated in each workstation, while the subtrees at the lower levels are 
mapped to individual workstations with no duplication. If the shared portion of the tree has M ^> N 
terminals, each corresponding to a non-shared subtree, the approach simply requires to associate the 
index of the owner workstation to each of these terminals. As the search progresses, each workstation 
keeps track of the sizes of the subtrees it owns, and, if a workstation is overloaded, it can restore memory 
balance by reassigning some of its subtrees to light-loaded workstations. The overhead to rebalance 
is clearly much lower than with a hash function, which requires to reallocate all the states discovered 
so far if it is modified. Fig. [3] illustrates this idea, where the states actually stored by workstation w, 
for = 1,2,3,4, are shown in the four quadrants. When workstation 1, 2, or 4 searches for state s, with 
A(s) = 3, it reaches the leaf storing state k and learns that workstation 3 owns the subtree rooted at k. 
When workstation 3 searches for state s, it instead reaches the node storing state k and continues the 
search, until it finds s in that subtree, as in the figure, or determines that state s must be inserted in that 
subtree. The shared top portion of the tree represents a small overhead in practice, since a value for M of 
the order of 10 to 100 times N is large enough in practice, and this is negligible with respect to the number 
of reachable states in practical applications. Only minimal synchronization is required, initially to agree 
on the structure of the top portion of the search tree and the assignment of its leaves to workstations, and, 
occasionally during state-space generation, to agree on a different subtree-to-workstation mapping, and 
broadcast this change. The asynchronous communication between pair of workstations is then limited, 
as before, to sending newly found states determined to belong to another workstation (which again can 
be reduced if the top portion of the search tree is such that the search is determined by just a few local 
state components), and to exchange subtrees when load balancing is required. 

Both [16], which requires the user to explicitly provide a partitioning function, and the more auto- 
mated tree-based approach of IT351 achieve close to linear speedups, as well as excellent memory load 
balance on conventional distributed memory architectures. 

In ROll . the explicit model checker Mur0 is parallelized on a NOW. With the support of a fast message 
passing scheme, active messages, each process runs asynchronously without global synchronization. A 
universal hash function is used to determine the workstation to which a state belongs to and the property 
of this hash function guarantees that states are evenly distributed among the workstations. The parallel 
version of Mur0 achieves close to linear speedup. Also the tool SPIN was parallelized 021 . but the 
focus was not on speedup, rather on the ability to handle large models otherwise intractable. In this 
work, communication becomes a dominant factor compared to the time to compute successor states. To 
minimize communication, the partition function X depends on one state component. As already shown 
in |fT6ll , this reduces cross-transitions between processes. The parallel version of SPIN retains the most 
important memory and complexity reduction techniques employed by the sequential version. 

Beyond state space generation, 0J proposed a distributed algorithm for LTL model checking, build- 
ing upon a parallel algorithm for accepting cycle detection in Biichi automata |2). Sequential solutions 
to this problem rely on depth first search (DFS), which is hard to parallelize. The basic idea of this work 
is then to detect back-level edges, i.e., edges (i,j) where the distance of state i from the initial state is 
greater than the distance of state j from the initial state. Parallel breadth first search (BFS) is employed 
to detect back-level edges. After each BFS step, workstations synchronize and detect back-level edges. 
DFS is then employed on each workstation in parallel to find cycles. Techniques are employed to reduce 
state revisiting, and partial order reduction can be combined with this distributed algorithm. This par- 
allel scheme falls into the basic framework discussed at the beginning of this section, showing that this 
framework is applicable to not only reachability analysis, but also to more complicated model checking. 

In conclusion, explicit distributed-memory algorithms mainly focus on how to maintain load balance, 
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Figure 3: A tree-based partitioning function (N = 4, M = 6). 



using static strategies such as a good hash function or our partially-shared search tree, and try to minimize 
communication by defining a partition function that exploits knowledge of the state structure. 

2.2 Shared-memory approaches for explicit state-space generation 

With respect to explicit shared-memory approaches, HE] "port" the parallel algorithm from distributed- 
memory to shared-memory architecture in conjunction with various techniques to improve real- world 
performance for LTL model checking. First, lightweight threads play the role of workstations in the 
NOW framework, and mutual exclusion techniques are used to prevent data races. Then, a two-level 
lock algorithm is used to reduce synchronization overhead, essentially changing the lock granularity 
of the data structure. Furthermore, FIFO queues handle message passing between threads, to reduce 
communication overhead, and are also employed to solve memory allocation issues. Experimental results 
show that the implementation scales up to 16 cores and has better performance than the MPI version. 
101 concludes the main bottleneck is the state generator and proposes in future work to balance the 
performance of state generator for better scalability of the entire algorithm. 

Another relevant work is |[29l . which provides another algorithm for reachability analysis in the con- 
text of CTL* model checking. A work stealing two-queue structure is used to dynamically maintain load 
balance during state exploration, with low synchronization overhead. Each process has an unbounded 
shared queuue and a bounded private queue to store unexplored states. A process is allowed to add and 
remove states in its own queues, but it can also remove states from the shared queue of other processes, 
thus it can steal another process's work instead of going idle. Of course, shared queues must be guarded 
by a lock, introducing some synchronization overhead in exchange for good load balance. Experimental 
results show almost linear speedups up to a 12 to 16 processors, depending on the state space size; beyond 
that, the benefits of using more processors are offset by synchronization and scheduling overheads. 
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This observation leads us to the second part of paper: explicit approaches for state-space generation 
exhibit good scalability on distributed and parallel systems, but only up to some point, beyond which 
further increasing their performance becomes very difficult. Since symbolic approaches tend to work 
very well even just in sequential implementations, why not attempt to parallelize them? 

3 Parallelizing symbolic state space generation 

For symbolic approaches, the first and foremost problem is to identify a set of parallel "tasks", which is 
a challenge as most symbolic operations are recursive and inherently sequential. Still, there is a large 
amount of research attacking this problem from different angles. Parallelism can be realized through low- 
level symbolic operations, such as logic conjunction or disjunction operations in DDs, or through high- 
level algorithms, such as BFS or Saturation. On distributed-memory systems, symbolic DD structures 
can be stored in "vertical" or "horizontal" partitions, as we will discuss in Section I3TT1 

3.1 Distributed-memory approaches for symbolic state-space generation 

One way to achieve parallelization for symbolic techniques is designing parallel BDD libraries for a 
NOW environment. |[34l[38ll42l discuss how to store and manipulate BDDs on a NOW. These approaches 
to achieve parallelism lie on low-level DD operations and, if they succeed in providing an interface 
similar to that provided by an ordinary sequential DD library, they can be applied to traditional DD-based 
algorithms without requiring them to be rewritten. As these packages are mostly applied to benchmark 
synchronous circuits, their performance on asynchronous models is unknown. 

At a higher level, parallelism can be attained through a state-space partitioning approach similar to 
the one we discussed in Section l2Tl This symbolic strategy, which we call vertical partitioning, assumes 
a partition of the potential state space X pot into a set of N "windows" {W\ , W^}. These are in fact 
exactly analogous to the function X required for the distributed explicit approach: we can simply think 
of W w as {i G 3£ pot : A(i) = w}. In practice, we require that the sets W w be easily encoded as DDs, thus 
they usually depend on just a few variables (interestingly, this also bears similarity with the requirements 
for a good choice of A). The approach, shown in Fig. HI is similar in spirit to the explicit one of Fig. |2j 
except all operations are performed symbolically on DDs, not on individual states. When workstation w 
explores its set of unexplored states ^ w by applying the next-state function Jf , it finds both states that 
it owns (states in W w ) and states that belong to other workstations w' (states in W w '); the latter are sent to 
the appropriate workstations, encoded as DDs. 

Just as in the explicit case, the choice of partition is critical, and even more difficult: 

• A balanced partition of potential states ,3£ pot does not imply a balanced partition of reachable states 
SCrch, yet, the slicing windows are defined on the potential state space. 

• A balanced partition of the reachable states 3£ rc h does not imply a balanced number of DD nodes, 
since the number of states encoded by a DD is not directly related to the number of DD nodes. 

• Even if {Wi, are a partition of potential states, thus result in a partition of the reachable 
states, this does not imply absence of DD node duplication (top of Fig. [5]). Indeed, it is obvious that 
the minimum amount of DD node duplication will occur when the DD is a tree, which is generally 
the worst case for the application of symbolic approaches (bottom of Fig. [5]). 

In summary, the goal should be to minimize the sizes of the DDs managed by the N workstations (i.e., 
minimize the size of the largest DD, or the the sum of the DD sizes), but the vertical partitioning ap- 
proach, being after all based on partitioning states and not DD nodes, might fail to achieve this goal. 
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SendBddToWorkstation{£P w n ^v,v); 



Figure 4: Distributed symbolic state-space generation using vertical slicing. 




Figure 5: Problems with vertical partitioning. 
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Nevertheless, some success was exhibited with this approach, through intelligent dynamic work- 
load balancing, which essentially reduces to intelligent choices of windows (and reassessments of these 
choices). In 11271 . a slicing method is proposed to achieve balanced slices that, with the help of a "cost 
function", also keep the number of duplicated DD nodes low after a re-slicing. In ll26l a more advanced 
strategy is proposed to reallocate the task to free process when necessary and release a process when its 
work load is small. This work-efficient approach attempts to adaptively minimize the number of work- 
stations employed at any one point during the fixed point iterations: we start with one workstation and 
begin state-space generation, then we periodically reassess the situation and, if all workstations have an 
excessive memory load, we increase N and use a finer slicing, or, if they all have a very light mem- 
ory load, we decrease ,/V and use a coarser slicing. This optimizes workstation utilization and reduces 
communication, so it is a good idea (indeed, it could and should be employed also in conjunction with 
the horizontal partitioning we described next). However, this is just a confirmation that achieving true 
speedup through parallelization is hard for symbolic methods. Maximizing utilization is not the goal, 
otherwise we would simply just use one workstation! In practice, a fairly large number of workstations 
might be available, whether we use them or not, so the real goal is reducing run time for a given memory 
footprint, or reducing the memory footprint for a given runtime, which is much harder. 

Further improvement can be achieved using asynchronous DD exchanges. 1261 observed that "the 
fact that the reachability computation is synchronized in a step-by-step fashion has a major impact on 
the computation time". Due to the need to exchange non-owned states, processes synchronize after each 
image computation, and the slowest one determines the speed of the overall computation. To overcome 
this drawback, a fully asynchronous distributed algorithm is proposed in ll25l . where processes do not 
synchronize at each iteration, but instead run concurrently without waiting for each other. The classic 
two-phase Dijkstra algorithm is then adapted to this framework, where the number of processes can vary 
dynamically. An "early split" is introduced to utilize free processes and achieve speedup. Compared to 
the previous approach of ll26l . improvements of up to a factor of ten are reported for some large circuits. 

We now move to consider distributed approaches for the Saturation algorithm. Saturation executes 
in a node-wise fashion, instead of using heavy global breadth-first iterations, but this also means that 
Saturation follows a strict order of when firing events on nodes: a node is saturated only after all of its 
children have been saturated. This policy is quite efficient but difficult to parallelize. 

In iTTTi we adopted horizontal partitioning ll38l to distribute the Saturation algorithm. Assuming that 
the number of levels L is at least as large as the number of workstations N, and hopefully much larger, 
MDDs nodes are distributed to workstations according to their level: workstation w owns a contiguous 
range of levels Jzf w = {mytop w , ...,mybot w }, so that {Jz?n, ■■■,-& > i} constitute a partition of the set of lev- 
els {L, ...,1} (see Fig.©. Since Saturation fires event a starting at level Top(a), such an arrangement 
allows the appropriate workstation to start firing an event, and, if the recursive firing reaches a boundary 
level, the workstation simply issues a request to continue the operation in the workstation responsible 
for the next set of levels below, and goes idle, waiting for a reply. The use of quasi-reduced 0T1 MDDs 
simplifies the implementation, since it naturally allows us to associate a unique table UT^ and an oper- 
ation cache OCk to each level k. Then, workstation w stores and manages {UT mytopw , ...,UT my b t w } and 
{OC mytoPw+ i, ...,OC my bot w +i}- The advantage of horizontal partitioning is clear: absolutely no duplica- 
tion of nodes or cache entries. Furthermore, memory balance simply requires to reallocate levels by 
changing boundaries across neighboring workstations and moving the corresponding nodes and cache 
entries, and it is easy to calculate what the new memory load will be due to such an exchange before 
performing it. However, as stated, this approach is completely sequential: at any one time, exactly one 
workstation is performing work, while the others are idle, waiting for results or to start the saturation of 
nodes at their levels. Thus, any speedup is due to being able to exploit the overall memory of a NOW, 
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Figure 6: Horizontal partition used in distributed Saturation. 



and is observed only when comparing with sequential Saturation running with insufficient memory. 

To achieve a true speedup within this horizontal-partitioning Saturation framework, lPT2l employs 
speculative computation, by using workstations' idle time to perform potentially useful firings. More 
precisely, we compute the relational product of a node p at level k in the L-level MDD of the current 
breach and a node r at level k of the 2L-level MDD encoding jV a , where Top(a) > k. If, later on, the 
firing of a on a node at level Top(a) reaches node p, the speculation pays off, as we simply retrieve 
the result (computed using idle workstation time!) from the cache. However, excessive speculation can 
easily lead to memory overload, as the unique table and the operation cache may end up containing many 
useless entries that will not be needed in future. To reduce this problem, |fl"3l associates a. firing pattern 
(the set of events non-speculatively fired on a node so far) to each MDD node, and computes a score 
for each speculation to reflect how likely the speculation of an event on a node is to move it toward 
another pattern. Furthermore, the score can take into account pattern popularity, i.e., how many nodes 
have a particular pattern. With respect to lfl2l . the results in |[T3l show how patterns can be used to avoid 
excessive speculation, so that, when speculation does not help speedup the computation, at least it does 
not harm memory much. Overall, a speedup of up to a factor of two is observed in many models using 
N = & workstations, with moderate increase in memory consumption (i.e., a workstation might use up to 
1.9/8 of the memory required when N = I, while, without speculation, the horizontal partitioning uses 
only little over 1/8 of the memory required when N = I, thus achieves almost perfect memory balance 
and no memory overhead, but no speedup at all, actually a slowdown due to communication). 



3.2 Shared-memory approaches for symbolic state-space generation 

With shared-memory, the main concern shifts from the communication overhead of coarse-granularity 
processes to the locking and mutual exclusion requirements of fine grained processes. Consider a call 
RelProd{p,r), that is, a relational product call reaching nodes p and r, both associated with variable 
Xk, as shown in Fig.|7J The recursion will issue the calls RelProd(p[0],r[0][0]), RelProd(p[0],r[0][l]), 
RelProd(p[l],r[\}[2]), RelProd{p[2],r[2][l}), RelProd(p[2],r[2}[2]), and RelProd(p[3],r[3)[3]). These 
can be issued in any order; indeed, they can be run in parallel, if enough cores or processors are available. 

However, DDs are not trees, they hopefully contain many recombining paths. Thus, multiple recur- 
sive calls may reach the same argument pairs. In a sequential approach, we use the cache to avoid repeat- 
ing computations. In a shared-memory approach, in addition, we need to use some locking mechanism 
to avoid all redundant computations; one possible approach is for a process to first insert its intention 
to build a result in the cache, by immediately inserting in the cache a dummy value before initiating a 
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L-level MDD encoding a set 2L-level MDD encoding a next-state function 




Figure 7: Potential fine-grained parallelism in a relational product computation. 

computation, to be substituted by the actual result value later, once it has been computed. Concurrent 
cache lookups by other processes needing the same result retrieve either this dummy value from the 
cache (the processes will know that the value is already being computed and will be available soon) or 
the actual result (as in the sequential case). Of course, processes must issue locks on the unique table 
and the operation cache; we can avoid excessive serialization by partitioning the unique table and cache 
(for example by levels), so that the unit of memory being locked is finer, reducing blocking probability. 

This works, and can indeed achieve reasonable speedup for symbolic breadth-first state-space gen- 
eration. However, since Saturation tends to be enormously more efficient than breadth-first iterations, 
we should take this approach to parallelize Saturation. Ironically, exploring this opportunity led us to 
further speedup sequential Saturation first |[T5l . One of the reasons for the efficiency of Saturation is its 
extensive use of chaining j39ll : if events a and j8 can be fired on the set of states SC , 

(for the inclusion to be strict, a must add new states that enable j8). Then, chaining was proposed as 
heuristic that looks at the system structure (Petri net, circuit) to derive a good event order so that firing 
"help compound each other's effect". 

In lfT51l . we applied this idea by considering not the structure of the high-level model, but of the MDD 
itself, when performing a relational product. Let r& be the MDD encoding Urop(a)=/t° / *«- To saturate p 
at level k, we build its dynamic transition graph 

G p = (& k , £T p ) where & p = {(/, j) £ : p[{\ ? A r k [i\ [j] ? Ap[j] ? 1}. 

If the dynamic transition graph has a path from i to j but not from j to i, then we should not issue the 
call RelProd{p [j], [/][•]) until the calls RelProd(p[-]ri c [■][/]) have converged (i.e., they cannot add more 
states). This is not a heuristic, it is guaranteed to be optimal 

Unfortunately, this observation does not suffice to provide us with a total order on the firings, since 
the dynamic transition graph may contain strongly-connected components. To "break cycles", we define 
the fullness of node p as (p) = "number of paths encoded by p"/\ x • • • x SC\ |, then, under a uniform 
distribution assumption, adding the result of the call RelProd(p[i],r] c [i][j]) increases <p{p[j]) by 

A « \X k X ■ ■• X X x \ -(j>ip[i\) -Hrkim) • (1 - *(?[/])) 

in expectation. Thus, we call first RelProd(p[i],ri c [i][j]) for the pair (i,j) maximizing 0a- This heuristic 
was shown to work quite well in practice, resulting in consistently better run time (up to 4 x ) and mem- 
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Figure 8: The dynamic firing graph in a relational product computation. 

ory (up to 3x) than previous (sequential) implementations of Saturation. Furthermore, the overhead to 
maintain fullness data and to build and update G p when saturating p was shown to be negligible. 

Unfortunately (from a parallelization perspective), this result improves sequential Saturation by im- 
posing a strict order on the RelProd calls. This fine-grained chaining heuristics can help us understand 
what happens when parallelizing Saturation. If G p contains no path from i to j and no path from j to i, 
we can perform some firing in parallel, out of i and out of j, for example (of course, we must use fine 
locks since the DD is a DAG, not a tree). If G p contains a path from i to j but not from j to i, we know 
that we should perform the firing from i on the path to j before firing any event on j, otherwise we may 
hurt chaining. If G p contains a path from i to j and a path from j to i, we know we need to break the 
cycle (even in the sequential case) and may hurt chaining, but parallelization can further hurt chaining. 

This was experimentally verified in a Cilk J6:] implementation at York University ll22l running on a 
shared-memory multicore processor computer system. We achieved some speedup on some models on 
a four-core machine, but also experienced substantial slowdowns on many models, when parallelization 
hurts chaining. Thus, this approach is likely not scalable in the number of cores for practical models. 

One intriguing possibility that needs further investigation is that we can obviously fire in parallel on 
different nodes at the same level. However, it remains to be seen how often this situation can be exploited 
in practice using Saturation, especially in the common case where contains a single state. 

4 Challenges and future goals 

As we argued at the beginning of this paper, achieving good speedups is the central goal for future work 
on parallel symbolic state-space generation and formal verification. From the above discussion, we can 
summarize the challenge in the following points: 

• Finding appropriate workload partitioning. 

• Minimizing the synchronization overhead, especially due to global synchronizations. 

• Devising efficient mutual exclusive schemes in DD operations. 

The first point derives from the fact that DDs are brittle in time and memory consumption during their 
computation, so that balanced workloads across processes are hard to achieve and maintain. From previ- 
ous work, we can observe that distributed algorithms achieving good speedups are mainly asynchronous, 



G. Ciardo, Y. Zhao, X. Jin 



15 



while those with global synchronizations are often not as competitive. On the shared-memory side, sym- 
bolic algorithms are memory intensive, and frequent accesses to lock-protected data can greatly reduce 
the potential parallelism. 

We believe that the parallelization of Saturation is still a promising, albeit challenging, work. The 
main open question is how to find more tasks that can be executed in parallel to achieve true speedup. The 
reported results in E2l . which are still far from satisfying, show that there is a subtle trade-off between 
the parallelism and the level-wise firing order of Saturation. Naive parallelization of event firings will 
likely not lead to a faster algorithm, as the order of firing has enormous impact on the performance. 
Rather, exploring how to extract all the possible parallelism, both at a coarse and at a fine granularity, 
while respecting the partial order of operations required by the Saturation approach, is likely to offer the 
greatest payback. In addition, the speculative firing ideas used in lfT2l [T3l [141 might still be helpful to 
provide further parallelism, by exploiting idle processor or core time. 
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